---
title: Data Quality Assessment
description: The Data Quality Assessment automatically detects and often handles data quality issues such as outliers, leading or trailing zeros, target leakage, and many more.

---

# Data Quality Assessment {: #data-quality-assessment }

The Data Quality Assessment capability automatically detects and surfaces common data quality issues and, often, handles them with minimal or no action on the part of the user. The assessment not only saves time finding and addressing issues, but provides transparency into automated data processing (you can see the automated processing that has been applied). It includes a warning level to help determine issue severity.

See the associated [considerations](#feature-considerations) for important additional information.

As part of [EDA1](eda-explained#eda1), DataRobot runs checks on features that don’t require date/time and/or target information. Once EDA2 starts, DataRobot runs additional checks. In the end, the following checks are run:

* [Outliers](quality-check#outliers)
* [Multicategorical format errors](quality-check#multicategorical-format-errors)
* [Inliers](quality-check#inliers)
* [Excess zeros](quality-check#excess-zeros)
* [Disguised missing values](quality-check#disguised-missing-values)
* [Target leakage](quality-check#target-leakage)
* [Missing images](quality-check#missing-images) (Visual AI projects)

Time series projects run all the baseline data quality checks as well as checks for:

* [Imputation leakage](quality-check#imputation-leakage)
* [Pre-derived lagged features](quality-check#pre-derived-lagged-feature)
* [Irregular time steps](quality-check#irregular-time-steps) (inconsistent gaps)
* [Leading or trailing zeros](quality-check#leading-or-trailing-zeros)
* [Infrequent negative values](quality-check#infrequent-negative-values)
* [New series in validation](quality-check#new-series-in-validation)

The [Visual AI project](visual-ai/index) Data Quality Assessment runs the same baseline checks and an additional missing image check:

* [Missing images](quality-check#missing-images)

Once EDA1 completes, the Data Quality Assessment appears just above the feature listing on the **Data** page.

![](images/dq-1.png)

In addition to the baseline data quality assessment, DataRobot provides additional detail for time series and Visual AI projects. Once model building completes, you can view the [Data Quality Handling Report](dq-report) for additional imputation information.

##  Overview {: #overview }

The Data Quality Assessment provides information about data quality issues that are relevant to your stage of model building. Initially run as part of EDA1 (data ingest), the results report on the **All Features** list. It runs again and updates after EDA2, displaying information for the selected feature list (or, by default, **All Features**). For checks that are not applicable to individual features (for example, Inconsistent Gaps), the report provides a general summary. Click **View Info** to view (and then **Close Info** to dismiss) the report:

![](images/dq-3.png)

Each data quality check provides issue status flags, a short description of the issue, and a recommendation message, if appropriate:

* Warning (![](images/icon-warning.png)): Attention or action required

* Informational (![](images/icon-info-dq.png)): No action required

* No issue (![](images/icon-ok.png))

Because the results are feature-list based, it is possible that if you change the selected feature list on the **Data** page, new checks will appear or current checks will disappear from the assessment. For example, if feature list `List 1` contains a feature `problem`, which contains outliers, the outliers check will show in the assessment. If you change lists to `List 2` which does not include `problem` (or any other feature with outliers), the outliers check will report "no issue" (![](images/icon-ok.png)).

From within the assessment modal, you can filter by issue type to see which features triggered the checks. Toggle on **Show only affected features** and check boxes next to the check names to select which checks to display:

![](images/dq-2.png)

DataRobot then displays only features violating the selected data quality checks, and within the selected feature list, on the **Data** page. Hover on an icon for more detail:

![](images/dq-7.png)

For multilabel and Visual AI projects, **Preview Log** displays at the top if the assessment detects [multicategorical format errors](quality-check#multicategorical-format-errors) or [missing images](quality-check#missing-images) in the dataset. Click **Preview Log** to open a window with a detailed view of each error, so you can more easily find and fix them in the dataset.

![](images/data-quality-prev-1.png)

##  Explore the assessment {: #explore-the-assessment }

Once EDA1 completes and you have, perhaps, filtered the display, view the list of features impacted by the issues you are interested in investigating. To see the values that triggered a warning or information notification, expand a feature and review the **Histogram** and **Frequent Values** visualizations.

###  Interpret the Histogram tab {: #interpret-the-histogram-tab }

{% include 'includes/histogram-include.md' %}

### Interpret Frequent Values {: #interpret-frequent-values }

The [Frequent Values](histogram#frequent-values-chart) chart, in addition to showing common values, reports inliers, disguised missing values, and excess zeros.

![](images/dq-9.png)

## Read more {: #read-more}

To learn more about the topics discussed on this page, see:

* A [detailed descriptions](quality-check) of each check.
* A [summary of the logic](quality-check#data-quality-check-logic-summary) behind each of the data quality checks.

## Feature considerations {: #feature-considerations }

Consider the following when working with Data Quality Assessment capability:

* For disguised missing values, inlier, and excess zero issues, automated _handling_ is only enabled for linear and Keras blueprints, where they have proven to reduce model error. Detection is applied to all blueprints.
* You cannot disable automated imputation handling.
* A public API is not yet available.
* Automated feature engineering runs on raw data (instead of removing all excess zeros and disguised missing values before calculating rolling averages).